{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一般从第三项开始，有详细的就往前走，没有详细的就往后走  \n",
    "\n",
    "1. 完成详细页面单个元素的获取，比如一条评论，一张图片；get_one_picture()\n",
    "2. 完成详细页面单类元素的抽取，比如一篇新闻的评论，比如一个页面的所有图片；get_pictures()\n",
    "3. 完成详细页面所有内容的获取和解析，解析大部分都是得到，如果需要评论和图片则直接调用评论函数；get_detail()\n",
    "4. 完成首页所有内容的获取；parse_list_links()\n",
    "5. 完成多页内容的获取，使用 for 循环  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用浏览器的检查功能，定位自己要爬取的内容\n",
    "\n",
    "得到 response 以后，一般要 print(r.text) 看一下  \n",
    "\n",
    "多用 print()  \n",
    "\n",
    "soup.select() 一点一点增加功能，一步一步打印\n",
    "\n",
    "select() 可以选择标签、.class 和 #id，优先选择 .class\n",
    "\n",
    "可以级联选择 soup.select('.date-source span a')\n",
    "\n",
    "\\[0] 去掉列表外面的 []，.text 获取文本\n",
    "\n",
    "可以获取属性 link[\"href\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 第一节  \n",
    "\n",
    "抽取、转换、存储 ETL(Extract Transformation Loading) 把非结构化的数据转化为结构化的数据  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 第二节  \n",
    "\n",
    "文字资料大部分都在 Doc 下面，选择不同的类型，才更好找。比如 Img，比如 Media    \n",
    "\n",
    "可以点击 Response 看看所找的这个 URL 下面的内容是不是自己要找的  \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 第三节  \n",
    "\n",
    "右键检查 -> Network -> Doc -> china/ -> 看 response  \n",
    "\n",
    "90% 的情况下，打开 Doc 第一个链接都可以把新闻爬取下来  \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 第五节  \n",
    "\n",
    "获取 HTML 文件后，使用 DOM tree 的节点提取内容。DOM（Document Object Model）  \n",
    "\n",
    "最上面是 HTML，然后是 body，然后是 h1、a 标签等等  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 第七节  \n",
    "\n",
    "爬虫两个有力的工具，requests 和 BeautifulSoup  \n",
    "\n",
    "使用浏览器的检查功能，定位自己要爬取的内容  \n",
    "\n",
    "选中的时候，看最下面一行，有层级位置  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 第八节\n",
    "\n",
    "首页"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests \n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "url = \"https://news.sina.com.cn/china/\" \n",
    "r = requests.get(url) \n",
    "r.encoding = \"utf-8\" \n",
    "soup = BeautifulSoup(r.text, \"html.parser\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "视频写法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for news in soup.select(\".news-item\"):\n",
    "    if len(news.select('h2')) > 0:    # 有些标题为空  \n",
    "        h2 = news.select('h2')[0].text    # 标题\n",
    "        a = news.select('a')[0]['href']    # 链接  \n",
    "        print(h2, a)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "网站有了变化，这是自己的写法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://news.sina.com.cn/c/2021-06-08/doc-ikqciyzi8487280.shtml\n",
      "https://news.sina.com.cn/c/2021-06-08/doc-ikqciyzi8486907.shtml\n",
      "https://news.sina.com.cn/c/2021-06-08/doc-ikqcfnaz9883230.shtml\n",
      "https://news.sina.com.cn/o/2021-06-08/doc-ikqcfnaz9875189.shtml\n",
      "https://news.sina.com.cn/c/2021-06-08/doc-ikqcfnaz9862135.shtml\n",
      "https://news.sina.com.cn/o/2021-06-08/doc-ikqcfnaz9856736.shtml\n",
      "https://news.sina.com.cn/c/2021-06-08/doc-ikqcfnaz9856489.shtml\n",
      "https://news.sina.com.cn/c/2021-06-08/doc-ikqcfnaz9851289.shtml\n",
      "https://news.sina.com.cn/c/2021-06-08/doc-ikqciyzi8450653.shtml\n",
      "https://news.sina.com.cn/o/2021-06-08/doc-ikqcfnaz9850851.shtml\n",
      "https://news.sina.com.cn/o/2021-06-08/doc-ikqciyzi8448924.shtml\n",
      "https://news.sina.com.cn/c/2021-06-08/doc-ikqciyzi8446863.shtml\n",
      "https://news.sina.com.cn/o/2021-06-08/doc-ikqcfnaz9845516.shtml\n",
      "https://news.sina.com.cn/o/2021-06-08/doc-ikqcfnaz9847036.shtml\n",
      "https://news.sina.com.cn/c/2021-06-08/doc-ikqcfnaz9840614.shtml\n"
     ]
    }
   ],
   "source": [
    "for news in soup.select('.right-content li a'):\n",
    "    title = news.text\n",
    "    link = news['href'] \n",
    "    print(link)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "具体新闻页"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests \n",
    "from bs4 import BeautifulSoup \n",
    "\n",
    "news_detail_url = 'https://news.sina.com.cn/c/2021-06-08/doc-ikqcfnaz9840614.shtml'\n",
    "r = requests.get(news_detail_url) \n",
    "r.encoding = 'utf-8' \n",
    "soup = BeautifulSoup(r.text, 'html.parser') "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "获取文章标题  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'注意！这些地方，高考查分时间公布'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.select('.main-title')[0].text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "获取文章发布时间"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'str'>\n",
      "2021年06月08日 15:09\n"
     ]
    }
   ],
   "source": [
    "# time_source = soup.select('.time-source')[0].contents[0].strip()\n",
    "time_source = soup.select('.date-source span')[0].text\n",
    "print(type(time_source)) \n",
    "print(time_source)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datetime import datetime \n",
    "\n",
    "dt = datetime.strptime(time_source, '%Y年%m月%d日 %H:%M')    # 这个格式是上面 timesource 的格式，是复制下来替换的  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'2021-06-08'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dt.strftime('%Y-%m-%d')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "获取文章来源"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'央视'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.select('.date-source a')[0].text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "获取文章内容"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<p>　　原标题：注意！这些地方，高考查分时间公布→</p>,\n",
       " <p>　　2021年全国统一高考今天进入第二天，按照考试安排，部分省份的考生，今天在完成文综、理综，以及外语科目的考试后，将结束高考。</p>,\n",
       " <p>　　今年第三批高考综合改革省份迎来“新高考”，目前，全国实施新高考的省份达到了14个。实施新高考的省份，由于考试科目安排的不同，大部分在9日到10日间仍有部分科目的考试。今年新增的8个高考综合改革省份的考生，将在明天完成高中学业水平选考科目的考试后结束高考。北京、天津、山东、海南等省份的考生将在10日结束考试。目前，安徽、河南、湖南等地已经公布了高考查分时间，其中安徽预计为6月23日，湖南、河南大致为6月25日。</p>]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.select('.article p')[:-2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'原标题：注意！这些地方，高考查分时间公布→ 2021年全国统一高考今天进入第二天，按照考试安排，部分省份的考生，今天在完成文综、理综，以及外语科目的考试后，将结束高考。 今年第三批高考综合改革省份迎来“新高考”，目前，全国实施新高考的省份达到了14个。实施新高考的省份，由于考试科目安排的不同，大部分在9日到10日间仍有部分科目的考试。今年新增的8个高考综合改革省份的考生，将在明天完成高中学业水平选考科目的考试后结束高考。北京、天津、山东、海南等省份的考生将在10日结束考试。目前，安徽、河南、湖南等地已经公布了高考查分时间，其中安徽预计为6月23日，湖南、河南大致为6月25日。'"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "article = []\n",
    "for p in soup.select('.article p')[:-2]:  #  -2 是 p 标签最后一个是编辑名字，倒数第二个是字体，这里不需要，所以切片去掉\n",
    "    article.append(p.text.strip())    # strip() 是因为，会有空白的符号，首行两个空格，显示就是 \\u3000\\u3000\n",
    "' '.join(article)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "或者使用列表推导式实现同样的效果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'原标题：注意！这些地方，高考查分时间公布→ 2021年全国统一高考今天进入第二天，按照考试安排，部分省份的考生，今天在完成文综、理综，以及外语科目的考试后，将结束高考。 今年第三批高考综合改革省份迎来“新高考”，目前，全国实施新高考的省份达到了14个。实施新高考的省份，由于考试科目安排的不同，大部分在9日到10日间仍有部分科目的考试。今年新增的8个高考综合改革省份的考生，将在明天完成高中学业水平选考科目的考试后结束高考。北京、天津、山东、海南等省份的考生将在10日结束考试。目前，安徽、河南、湖南等地已经公布了高考查分时间，其中安徽预计为6月23日，湖南、河南大致为6月25日。'"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "' '.join([p.text.strip() for p in soup.select('.article p')[:-2]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "获取编辑的名字"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'责任编辑：张玉 '"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.select('.show_author')[0].text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'张玉 '"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.select('.show_author')[0].text.lstrip('责任编辑：')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 第十五节\n",
    "\n",
    "获取评论数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<span class=\"num\" node-type=\"comment-num\">0</span>]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.select('.tool-cmt .num')    # 用浏览器检查功能获得的标签，里面没有我们要找的内容，是用 JavaScript 渲染的"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import requests\n",
    "import json\n",
    "\n",
    "comments = requests.get('https://comment.sina.com.cn/page/info?version=1&format=json&channel=sh&newsid=comos-kqcfnaz9840614&group=0\\\n",
    "&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=3&t_size=3&h_size=3&thread=1&uid=unlogin_user&callback=jsonp_1623149626413&_=1623149626413')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "lstrip() 和 rstrip() 就是截取，和切片一样容易"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "jd = json.loads(comments.text.lstrip('jsonp_1623149626413(').rstrip(')'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "jd['result']['count']['total']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上面是原来的写法，后面有一串去掉照样可以用，不去掉，每一篇文章要 lstrip() 的内容都不一样，没法写程序，去掉以后简单多了"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import json\n",
    "\n",
    "comments = requests.get('https://comment.sina.com.cn/page/info?version=1&format=json&channel=sh&newsid=comos-kqcfnaz9840614&group=0\\\n",
    "&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=3&t_size=3&h_size=3&thread=1&uid=unlogin_user')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "jd = json.loads(comments.text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "jd['result']['count']['total']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "获取网页链接中的新闻编号，获取评论的时候会用到文章编号"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['https:',\n",
       " '',\n",
       " 'news.sina.com.cn',\n",
       " 'c',\n",
       " '2021-06-07',\n",
       " 'doc-ikqciyzi8287738.shtml']"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "news_url = 'https://news.sina.com.cn/c/2021-06-07/doc-ikqciyzi8287738.shtml' \n",
    "news_url.split('/')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'doc-ikqciyzi8287738.shtml'"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "news_url.split('/')[-1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'kqciyzi8287738'"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "news_id = news_url.split('/')[-1].rstrip('.shtml').lstrip('doc-i')\n",
    "news_id"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "或者使用正则表达式"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<_sre.SRE_Match object; span=(38, 63), match='doc-ikqciyzi8287738.shtml'>\n"
     ]
    }
   ],
   "source": [
    "import re \n",
    "\n",
    "m = re.search('doc-i(.*).shtml', news_url)\n",
    "print(m)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'doc-ikqciyzi8287738.shtml'"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.group()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'kqciyzi8287738'"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.group(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'kqciyzi8287738'"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "news_id = m.group(1)\n",
    "news_id"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 第十七节\n",
    "\n",
    "建立评论抽取函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://comment.sina.com.cn/page/info?version=1&format=json&channel=gn&newsid=comos-kqciyzi8287738&group=0&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=3&t_size=3&h_size=3&thread=1&uid=unlogin_user'"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "comment_url = 'https://comment.sina.com.cn/page/info?version=1&format=json&channel=gn&newsid=comos-{}\\\n",
    "&group=0&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=3&t_size=3&h_size=3&thread=1&uid=unlogin_user'\n",
    "\n",
    "comment_url.format(news_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个是某个页面的单项元素"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json \n",
    "\n",
    "def get_comment_count(news_url):\n",
    "    m = re.search('doc-i(.*).shtml', news_url)\n",
    "    news_id = m.group(1)\n",
    "    comments = requests.get(comment_url.format(news_id))\n",
    "    jd = json.loads(comments.text)  \n",
    "    return jd['result']['count']['total']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1642"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "news = 'https://news.sina.com.cn/c/2021-06-07/doc-ikqciyzi8287738.shtml'\n",
    "\n",
    "get_comment_count(news)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 第十八节\n",
    "\n",
    "完成新闻抽取函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个是单个页面的 detail，是核心"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests \n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "def get_news_detail(news_url):\n",
    "    result = {} \n",
    "    r = requests.get(news_url) \n",
    "    r.encoding = 'utf-8'\n",
    "    soup = BeautifulSoup(r.text, 'html.parser')\n",
    "    result['title'] = soup.select('.main-title')[0].text\n",
    "    time_source = soup.select('.date-source span')[0].text\n",
    "    result['dt'] = datetime.strptime(time_source, '%Y年%m月%d日 %H:%M')\n",
    "    result['news_source'] = soup.select('.date-source a')[0].text\n",
    "    result['article'] = ' '.join([p.text.strip() for p in soup.select('.article p')[:-2]])\n",
    "    result['editor'] = soup.select('.show_author')[0].text.lstrip('责任编辑：')\n",
    "    result['comments'] = get_comment_count(news_url)\n",
    "    return result "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "news_url = 'https://news.sina.com.cn/c/2021-06-07/doc-ikqciyzi8287738.shtml'\n",
    "\n",
    "get_news_detail(news_url)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 第十九节\n",
    "\n",
    "从列表链接中获取每篇新闻内容\n",
    "\n",
    "先在有很多内容是使用 ajax 渲染加载的，所以先看 Doc，再看 XHR，再看 JS  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 第二十节\n",
    "\n",
    "找到分页链接  \n",
    "\n",
    "分析可以看到是 page 参数控制分页  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个是新闻首页，获取这一页上所有的链接"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests \n",
    "from bs4 import BeautifulSoup \n",
    "import json\n",
    "\n",
    "r = requests.get('https://feed.sina.com.cn/api/roll/get?pageid=121&lid=1356&num=20&versionNumber=1.2.4&page=1&encode=utf-8&callback=feedCardJsonpCallback&_=1623103783456')\n",
    "r = '{'+ r.text.lstrip('try{feedCardJsonpCallback(').rstrip(');}catch(e){};') + '}}'   # 在 Preview 中右键，Open in new tab，观察结构\n",
    "jd = json.loads(r)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "for ent in jd['result']['data']:\n",
    "    print(ent['url'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "metadata": {},
   "outputs": [],
   "source": [
    "def parse_list_links(url):\n",
    "    r = requests.get(url) \n",
    "#     jd = json.loads('{'+ r.text.lstrip('try{feedCardJsonpCallback(').rstrip(');}catch(e){};') + '}}')\n",
    "    jd = json.loads(r.text)\n",
    "    news_details = [] \n",
    "    for ent in jd['result']['data']:\n",
    "        news_details.append(get_news_detail(ent['url']))\n",
    "    return news_details"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "url = 'https://feed.sina.com.cn/api/roll/get?pageid=121&lid=1356&num=20&versionNumber=1.2.4&page=1&encode=utf-8'\n",
    "parse_list_links(url)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 第二十四节  \n",
    "\n",
    "批次抓取每页新闻内容  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "url = 'https://feed.sina.com.cn/api/roll/get?pageid=121&lid=1356&num=20&versionNumber=1.2.4&page={}&encode=utf-8'\n",
    "news_total = [] \n",
    "for i in range(1, 3):    # 2 页\n",
    "    news_url = url.format(i)\n",
    "    news_array = parse_list_links(news_url)\n",
    "    news_total.extend(news_array)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "40"
      ]
     },
     "execution_count": 115,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(news_total)    # 一共多少篇文章"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 第二十五节\n",
    "\n",
    "使用 pandas 整理数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 116,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>dt</th>\n",
       "      <th>news_source</th>\n",
       "      <th>article</th>\n",
       "      <th>editor</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>美台将重启贸易谈判？这事没那么简单……</td>\n",
       "      <td>2021-06-08 20:53:00</td>\n",
       "      <td>长安街知事</td>\n",
       "      <td>原标题：美台将重启贸易谈判？这事没那么简单…… 美国国务卿布林肯7日在众议院外交委员会的一场...</td>\n",
       "      <td>刘光博</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>无惧压力支持中国 匈牙利总理：重投一百次也是一样</td>\n",
       "      <td>2021-06-08 20:53:00</td>\n",
       "      <td>长安街知事</td>\n",
       "      <td>原标题：无惧压力支持中国，匈牙利总理：重投一百次也是一样 当地时间4日，匈牙利再次使用一票否...</td>\n",
       "      <td>刘光博</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>猪肉价格持续下跌 肉制品“走”向何方？</td>\n",
       "      <td>2021-06-08 20:43:00</td>\n",
       "      <td>新华网</td>\n",
       "      <td>原标题：猪肉价格持续下跌，肉制品“走”向何方？ 新华社上海6月8日电　题：猪肉价格持续下跌，...</td>\n",
       "      <td>张玉</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>广州：体育场馆、星级酒店等场所即日起实行扫码进场</td>\n",
       "      <td>2021-06-08 20:42:00</td>\n",
       "      <td>澎湃新闻</td>\n",
       "      <td>原标题：广州：体育场馆、星级酒店等场所即日起实行扫码进场 广州市文化广电旅游局网站6月8日消...</td>\n",
       "      <td>刘光博</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>同日被“双开”的两名厅官 同日被捕</td>\n",
       "      <td>2021-06-08 20:41:00</td>\n",
       "      <td>中国青年报</td>\n",
       "      <td>原标题：同日被“双开”的两名厅官，同日被捕 6月8日，最高检网站发布青海省两名厅官被捕的消息...</td>\n",
       "      <td>刘光博</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                      title                  dt news_source  \\\n",
       "0       美台将重启贸易谈判？这事没那么简单…… 2021-06-08 20:53:00       长安街知事   \n",
       "1  无惧压力支持中国 匈牙利总理：重投一百次也是一样 2021-06-08 20:53:00       长安街知事   \n",
       "2       猪肉价格持续下跌 肉制品“走”向何方？ 2021-06-08 20:43:00         新华网   \n",
       "3  广州：体育场馆、星级酒店等场所即日起实行扫码进场 2021-06-08 20:42:00        澎湃新闻   \n",
       "4         同日被“双开”的两名厅官 同日被捕 2021-06-08 20:41:00       中国青年报   \n",
       "\n",
       "                                             article editor  \n",
       "0  原标题：美台将重启贸易谈判？这事没那么简单…… 美国国务卿布林肯7日在众议院外交委员会的一场...   刘光博   \n",
       "1  原标题：无惧压力支持中国，匈牙利总理：重投一百次也是一样 当地时间4日，匈牙利再次使用一票否...   刘光博   \n",
       "2  原标题：猪肉价格持续下跌，肉制品“走”向何方？ 新华社上海6月8日电　题：猪肉价格持续下跌，...    张玉   \n",
       "3  原标题：广州：体育场馆、星级酒店等场所即日起实行扫码进场 广州市文化广电旅游局网站6月8日消...   刘光博   \n",
       "4  原标题：同日被“双开”的两名厅官，同日被捕 6月8日，最高检网站发布青海省两名厅官被捕的消息...   刘光博   "
      ]
     },
     "execution_count": 116,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas\n",
    "\n",
    "df = pandas.DataFrame(news_total)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 第二十六节\n",
    "\n",
    "保存数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "df.to_excel('news.xlsx')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sqlite3\n",
    "\n",
    "with sqlite3.connect('sina_news.sqlite') as db:\n",
    "    df.to_sql('news', db)   # news 是数据库表名  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 120,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sqlite3\n",
    "\n",
    "with sqlite3.connect('sina_news.sqlite') as db:\n",
    "    df2 = pandas.read_sql_query('SELECT * FROM news', con=db)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index</th>\n",
       "      <th>title</th>\n",
       "      <th>dt</th>\n",
       "      <th>news_source</th>\n",
       "      <th>article</th>\n",
       "      <th>editor</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>美台将重启贸易谈判？这事没那么简单……</td>\n",
       "      <td>2021-06-08 20:53:00</td>\n",
       "      <td>长安街知事</td>\n",
       "      <td>原标题：美台将重启贸易谈判？这事没那么简单…… 美国国务卿布林肯7日在众议院外交委员会的一场...</td>\n",
       "      <td>刘光博</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>无惧压力支持中国 匈牙利总理：重投一百次也是一样</td>\n",
       "      <td>2021-06-08 20:53:00</td>\n",
       "      <td>长安街知事</td>\n",
       "      <td>原标题：无惧压力支持中国，匈牙利总理：重投一百次也是一样 当地时间4日，匈牙利再次使用一票否...</td>\n",
       "      <td>刘光博</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>猪肉价格持续下跌 肉制品“走”向何方？</td>\n",
       "      <td>2021-06-08 20:43:00</td>\n",
       "      <td>新华网</td>\n",
       "      <td>原标题：猪肉价格持续下跌，肉制品“走”向何方？ 新华社上海6月8日电　题：猪肉价格持续下跌，...</td>\n",
       "      <td>张玉</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>广州：体育场馆、星级酒店等场所即日起实行扫码进场</td>\n",
       "      <td>2021-06-08 20:42:00</td>\n",
       "      <td>澎湃新闻</td>\n",
       "      <td>原标题：广州：体育场馆、星级酒店等场所即日起实行扫码进场 广州市文化广电旅游局网站6月8日消...</td>\n",
       "      <td>刘光博</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>同日被“双开”的两名厅官 同日被捕</td>\n",
       "      <td>2021-06-08 20:41:00</td>\n",
       "      <td>中国青年报</td>\n",
       "      <td>原标题：同日被“双开”的两名厅官，同日被捕 6月8日，最高检网站发布青海省两名厅官被捕的消息...</td>\n",
       "      <td>刘光博</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   index                     title                   dt news_source  \\\n",
       "0      0       美台将重启贸易谈判？这事没那么简单……  2021-06-08 20:53:00       长安街知事   \n",
       "1      1  无惧压力支持中国 匈牙利总理：重投一百次也是一样  2021-06-08 20:53:00       长安街知事   \n",
       "2      2       猪肉价格持续下跌 肉制品“走”向何方？  2021-06-08 20:43:00         新华网   \n",
       "3      3  广州：体育场馆、星级酒店等场所即日起实行扫码进场  2021-06-08 20:42:00        澎湃新闻   \n",
       "4      4         同日被“双开”的两名厅官 同日被捕  2021-06-08 20:41:00       中国青年报   \n",
       "\n",
       "                                             article editor  \n",
       "0  原标题：美台将重启贸易谈判？这事没那么简单…… 美国国务卿布林肯7日在众议院外交委员会的一场...   刘光博   \n",
       "1  原标题：无惧压力支持中国，匈牙利总理：重投一百次也是一样 当地时间4日，匈牙利再次使用一票否...   刘光博   \n",
       "2  原标题：猪肉价格持续下跌，肉制品“走”向何方？ 新华社上海6月8日电　题：猪肉价格持续下跌，...    张玉   \n",
       "3  原标题：广州：体育场馆、星级酒店等场所即日起实行扫码进场 广州市文化广电旅游局网站6月8日消...   刘光博   \n",
       "4  原标题：同日被“双开”的两名厅官，同日被捕 6月8日，最高检网站发布青海省两名厅官被捕的消息...   刘光博   "
      ]
     },
     "execution_count": 122,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2.head()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
